Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modernize the threaded assembly howto #1068

Merged
merged 1 commit into from
Sep 26, 2024
Merged

Modernize the threaded assembly howto #1068

merged 1 commit into from
Sep 26, 2024

Conversation

fredrikekre
Copy link
Member

No description provided.

Copy link

codecov bot commented Sep 25, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 93.73%. Comparing base (c04c5ba) to head (0e9306c).
Report is 2 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1068   +/-   ##
=======================================
  Coverage   93.72%   93.73%           
=======================================
  Files          39       39           
  Lines        6011     6017    +6     
=======================================
+ Hits         5634     5640    +6     
  Misses        377      377           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@fredrikekre
Copy link
Member Author

@termi-official
Copy link
Member

I get NaNs every now and then

julia> nK4, nf4 = main(; n = 15, ntasks = 8) #src
  0.163506 seconds (141.79 k allocations: 20.707 MiB, 0.02% compilation time)
(1.759428561238959e13, 0.10506179512844058)

julia> nK4, nf4 = main(; n = 15, ntasks = 8) #src
  0.159455 seconds (141.79 k allocations: 20.707 MiB, 0.01% compilation time)
(1.759428561238959e13, 0.10506179512844058)

julia> nK4, nf4 = main(; n = 15, ntasks = 8) #src
  0.158596 seconds (141.79 k allocations: 20.707 MiB, 0.02% compilation time)
(NaN, NaN)

@fredrikekre
Copy link
Member Author

Try after fa2f1d2 ? Curious that it didn't always gave trash data though...

@termi-official
Copy link
Member

Saw the fix and I am also very confused that it worked in first place (or why it should fail only occasionally).

Do you have access to some cluster node (or @KnutAM ?) to check scaling? Or some other machine built to scale in the number of threads? Ideally with frequency scaling turned off. We should also check thread pinning to cores to give some recommendations here.

@termi-official
Copy link
Member

The issue remains on head

julia> nK4, nf4 = main(; n = 5, ntasks = 8) #src
  0.013251 seconds (10.54 k allocations: 5.768 MiB, 0.04% compilation time)
(NaN, NaN)

julia> nK4, nf4 = main(; n = 5, ntasks = 8) #src
  0.005826 seconds (10.54 k allocations: 5.768 MiB, 0.08% compilation time)
(4.760410118115698e12, 0.25328245103046515)

julia> nK4, nf4 = main(; n = 5, ntasks = 8) #src
  0.006254 seconds (10.54 k allocations: 5.768 MiB, 0.08% compilation time)
(4.760410118115698e12, 0.25328245103046515)

julia> nK4, nf4 = main(; n = 5, ntasks = 8) #src
  0.008274 seconds (10.54 k allocations: 5.768 MiB, 0.06% compilation time)
(NaN, NaN)

julia> nK4, nf4 = main(; n = 5, ntasks = 8) #src
  0.005915 seconds (10.54 k allocations: 5.768 MiB, 0.08% compilation time)
(NaN, NaN)

Seems to only happen with 8 and 16 threads tho (on an 8 core machine with hyperthreading).

@KnutAM KnutAM linked an issue Sep 25, 2024 that may be closed by this pull request
@KnutAM
Copy link
Member

KnutAM commented Sep 25, 2024

Great to fix this issue!
Just peaked, and noted an overwrite of cellvalues:

            @local scratch = ScratchData(dh, K, f, cellvalues)
            (; cell_cache, cellvalues, Ke, fe, assembler) = scratch

@fredrikekre
Copy link
Member Author

Ah, nice find. Probably that explains the large amount of allocations too..

@termi-official
Copy link
Member

The NaN issue seems to be gone now. However, it just removed 1/3 of the allocs.

@fredrikekre
Copy link
Member Author

For me it reduced it significantly (see the last commit).

Copy link
Collaborator

@lijas lijas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are my timings on the cluster with n=30:

Run with nthreads: 1
  8.996648 seconds (902 allocations: 816.172 KiB, 0.00% compilation time)
-----
Run with nthreads: 2
  4.957708 seconds (1.64 k allocations: 1.564 MiB, 0.00% compilation time)
-----
Run with nthreads: 4
  2.547859 seconds (3.12 k allocations: 3.099 MiB, 0.00% compilation time)
-----
Run with nthreads: 8
  1.555725 seconds (6.07 k allocations: 6.168 MiB, 0.00% compilation time)
-----
Run with nthreads: 16
  0.951255 seconds (11.97 k allocations: 12.306 MiB, 0.00% compilation time)
-----
Run with nthreads: 32
  0.614075 seconds (23.78 k allocations: 24.582 MiB, 0.01% compilation time)

cell_cache = CellCache(dh)
n = ndofs_per_cell(dh)
Ke = zeros(n, n)
fe = zeros(n)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously we allocated the scratch data in a different way: fes = [zeros(n_basefuncs) for i in 1:nthreads] (to avoid cache misses I believe). Is that not needed anymore? This is definitely cleaner.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This constructor will be called once per task yes.

OhMyThreads.@tasks for cellidx in color
@set scheduler = scheduler
## Obtain a task local scratch and unpack it
@local scratch = ScratchData(dh, K, f, cellvalues_tmp)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the scratch data only created once per task? It does not look like it because it inside the loop, but maybe that is what the macro is for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fredrikekre
Copy link
Member Author

On

Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 128 × AMD EPYC 9354 32-Core Processor

with n = 40 I get

julia> data
8-element Vector{Pair{Int64, Float64}}:
   1 => 17.816107
   2 => 8.693293
   4 => 4.78457
   8 => 2.882439
  16 => 1.754894
  32 => 1.330953
  64 => 1.08486
 128 => 0.708531

julia> p = UnicodePlots.lineplot(first.(data), last.(data); xscale = :log2, yscale = :log2);

julia> UnicodePlots.lineplot!(p, first.(data), last(first(data)) ./ first.(data))
              ┌────────────────────────────────────────┐
     2⁴⸱¹⁵⁵¹¹ │⠣⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
              │⠀⠈⢢⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
              │⠀⠀⠀⠙⢄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
              │⠀⠀⠀⠀⠀⠙⢦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
              │⠀⠀⠀⠀⠀⠀⠀⠑⢄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
              │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠑⢄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
              │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠳⡤⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
              │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠪⡢⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
              │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠘⢄⠑⠢⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
              │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠑⢄⠈⠒⢄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
              │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠣⡀⠀⠑⢢⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
              │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠢⡀⠀⠈⠉⠒⠤⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
              │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⢆⠀⠀⠀⠀⠀⠉⠉⠒⠢⠤⣀⡀⠀⠀⠀⠀⠀│
              │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠑⢄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠢⢄⠀⠀⠀│
   2⁻⁰⸱⁴⁹⁷⁰⁹⁷ │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠱⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠢⣀│
              └────────────────────────────────────────┘
              ⠀2⁰⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⁷

@fredrikekre
Copy link
Member Author

With threadpinning (pinthreads(:cores)):

julia> data
8-element Vector{Pair{Int64, Float64}}:
   1 => 16.316607
   2 => 8.182938
   4 => 4.095896
   8 => 2.060625
  16 => 1.051968
  32 => 0.607133
  64 => 0.354834
 128 => 0.42452

julia> p = lineplot(first.(data), last.(data); xscale = :log2, yscale = :log2);

julia> lineplot!(p, first.(data), last(first(data)) ./ first.(data))
             ┌────────────────────────────────────────┐
    2⁴⸱⁰²⁸²⁷ │⠉⠢⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
             │⠀⠀⠈⠢⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
             │⠀⠀⠀⠀⠈⠢⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
             │⠀⠀⠀⠀⠀⠀⠈⠢⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
             │⠀⠀⠀⠀⠀⠀⠀⠀⠈⠢⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
             │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠢⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
             │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠢⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
             │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠢⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
             │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠑⢄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
             │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠑⢦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
             │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠙⢦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
             │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠙⢖⢄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
             │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠑⢍⠢⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
             │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠢⡑⠢⡀⠀⠀⠀⠀⠀⠀⠀⠀│
   2⁻¹⸱⁴⁹⁴⁷⁸ │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠢⡈⠑⠤⣀⣀⡠⠤⠔⠒│
             └────────────────────────────────────────┘
             ⠀2⁰⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀2⁷

This patch rewrites the threaded assembly howto to use OhMyThreads.jl
which is a better interface to multithreading than using "raw"
`at-threads`. Also adds some more prose and explanations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update threading example
5 participants